2110.01615v1 [cs.DL] 3 Oct 2021 


arX1V 


Geography of Science: Competitiveness and Inequality 


Aurelio Patelli®®t, Lorenzo Napolitano>', Giulio Cimini®**, and Andrea 
Gabrielli? 


“Enrico Fermi Center for Study and Research, via Panisperna 89a, 00184, Rome (Italy) 
bEuropean Commission, Joint Research Center (JRC), 41092 Seville (Spain) 
°Physics Department and INFN, University of Rome Tor Vergata, 00133 Rome (Italy) 
¢Engineering Department, University “Roma Tre”, 00146 Rome (Italy) 
“Institute for Complex Systems (CNR), UoS Sapienza, 00185 Rome (Italy) 


Abstract 


Using ideas and tools of complexity science we 
design a holistic measure of Scientific Fitness, 
encompassing the scientific knowledge, capabil- 
ities and competitiveness of a research system. 
We characterize the temporal dynamics of Sci- 
entific Fitness and R&D expenditures at the 
geographical scale of nations, highlighting pat- 
terns of similar research systems, and showing 
how developing nations (China in particular) are 
quickly catching up the developed ones. Down- 
scaling the aggregation level of the analysis, we 
find that even developed nations show a consid- 
erable level of inequality in the Scientific Fitness 
of their internal regions. Further, we assess com- 
paratively how the competitiveness of each geo- 
graphic region is distributed over the spectrum of 
research sectors. Overall, the Scientific Fitness 
represents the first high quality estimation of the 
scientific strength of nations and regions, open- 
ing new policy-making applications for better al- 
locating resources, filling inequality gaps and ul- 
timately promoting innovation. 


1 Introduction 


Science is based on the progressive augmentation 
of existing knowledge building on past discover- 
ies, through a recursive process involving empir- 
ical observation and the formulation of testable 


hypotheses. Similarly to what happens for tech- 
nological innovation and economic growth 
[B], scientific progress requires appropriate capa- 
bilities: previous knowledge, tools, human cap- 
ital, resources, and so on. ‘The combination 
and interaction of such capabilities, even from 
different contexts, pushes the boundary of sci- 
ence through new knowledge and discoveries, as 
well as through re-discoveries via previously un- 
charted paths [4}/[5}{6]. This process naturally oc- 
cur mostly in geographic areas where many dif- 
ferent capabilities are concentrated [7|, whence 
we can assume that the scientific output of a re- 
gion reflects the set of relevant capabilities avail- 
able. 


The quantitative evaluation of scientific out- 
comes, from the microscopic level of individual 
researchers and institutions to the macroscopic 
case of entire nations, is nowadays a common 
practice [8] [9]. At the macro level, a seminal 
work by May assessed the performance of 
national research systems using an index bor- 
rowed from the economic literature: the Re- 
vealed Comparative Advantage (RCA) [L], com- 
puted on the number of scientific documents pro- 
duced by each nation in the various research sec- 
tors. King pursued a different approach, 
ranking nations according to the share of global 
citations received by their document output, and 
introduced funding as an additional variable of 
the analysis. Subsequently, the use of citations 


became the gold standard for assessing research 
quality, and several metrics with swinging perfor- 
mance have been proposed — see for a com- 
prehensive review of the field. However this ap- 
proach has recently been questioned, due for ex- 
ample to the very different amount of resources 
that nations invest in scientific research. In fact, 
even for the most economically developed na- 
tions, the scientific success measured on citations 
and the public spending in research and develop- 
ment (R&D) (as well as returns to inno- 
vation [15]) are correlated but also present strong 
deviations, and therefore should be considered as 
complementary dimensions for a correct evalua- 
tion of scientific performance. Another impor- 
tant problem is given by the presence of bias and 
distortions in citation patterns [7]. Indeed 
the dynamics of the citation process strongly de- 
pend on sector-specific characteristics, and cita- 
tion statistics are often distorted by the pres- 
ence of outliers (the few documents attracting a 
huge amount of citations) [19]. These and 
other issues may reduce the explanatory power 
of citation-based metrics, as well as their variants 
based on top-percentage citations [20], including 
the H-index [21]. 


There are two additional key aspects that cita- 
tion share metrics do not take into account. On 
the macroscopic scale, nations do not specialize 
in a few research sectors but tend to diversify 
their activity into as many sectors as possible. 
This is explained by the capability scheme, for 
which a given geographic area is active in all re- 
search sectors allowed by the capabilities that 
are present on its territory. Since capabilities 
are heterogeneously distributed, nations have a 
heterogeneous level of diversification, thus diver- 
sification itself can be used as a basic proxy of 
scientific performance. In addition, while na- 
tions with many different capabilities (typically, 
the developed economies) are competitive in al- 
most all existing research sectors, nations with 
fewer capabilities (the less developed economies) 
perform well only in a few research areas with 
a lower degree of sophistication or complexity. 
Such a nested structure, induced by the capabil- 


ity scheme, indicates the presence of a competi- 
tive mechanism shaping the connections amidst 
the scientific actors — akin to what is observed 
in natural ecosystems as well as in human 
productive activities [I]. Indeed, although the 
scientific environment is neither directly nor indi- 
rectly aimed at the production of physical goods 
or services (for which there is a clear payoff) 
and is not subject to the incentives of compet- 
itive markets, there are actually many sources 
of competition, since most research systems rely 
on merit-based processes to determine funding, 
hiring, careers, and thus indirectly scientific re- 
search itself. Therefore, only naively science 
can be considered as guided by non-competitive 
actors who collaborate for the advancement of 
knowledge. 


Overall, the nested pattern observed when 
comparing national research systems sug- 
gests that diversification and composition of the 
scientific research basket can be used to measure 
the scientific competitiveness (or Fitness) of a 
nation; at the same time, the complexity of a re- 
search sector depends on its ubiquity and on the 
Scientific Fitness of nations that are competitive 
in that sector. The Economic Fitness and Com- 
plexity (EFC) algorithm is the ideal 
tool to estimate the fixed point of this circular 
relation. The purpose of this work is precisely 
to develop a framework for quantifying scientific 
competitiveness by leveraging the EFC toolbox. 


In a nutshell, we build an appropriate 
database for our analysis starting from the Open 
Academic Graph (OAG)|[27] [28] 29] [30], a freely 
accessible collection of information about indi- 
vidual scientific publications, covering a large 
portion of the scientific production corpus. On 
the one hand, OAG assigns documents to geo- 
graphic areas according to the location of the 
research institutes to which the authors are af- 
filiated. On the other hand, OAG assigns docu- 
ments to research sectors according to a hier- 
archical classification of scientific topics, each 
known as Field of Studies (FoS). The docu- 
ments produced by a geographical area in a re- 
search sector provide a basic measure of scien- 


tific performance through an appropriate count 
of citations received. In this analysis we can 
use a variable resolution both in terms of geo- 
graphical scale (we follow the Territorial Level 
scheme implemented by the OECD [31]) and of 
FoS hierarchical level. Filtering this data us- 
ing the RCA allows obtaining the scientific bi- 
partite network (SBN hereafter) connecting geo- 
graphic areas with the research sectors they are 
competitive in, and finally computing the Sci- 
entific Fitness of such areas through the EFC 
algorithm |23|. Note that our approach follows 
the path initially outlined by May [10], though 
we compute RCA not on document production 
but on citation counts, in accordance with the 
ideas proposed by King — and likewise we 
complement the analysis with data about mone- 
tary resources invested in scientific research. In 
particular we use Higher Education expenditures 
on Research & Development (HERD), again pro- 
vided by the OECD [32]. We refer the reader to 
the Materials and Methods section for a more 
detailed description of the workflow. 


2 Results 


We start by discussing Scientific Fitness at the 
geographic scale of nations — corresponding to 
Territorial Level 1 (TL1) of the OECD classi- 
fication. Unsurprisingly the geographical distri- 
bution of Fitness values, reported in the top map 
of Figure [1] shows that the most developed and 
rich nations are also the top performers in sci- 
ence, while the developing nations are ranked 
lower [23]. Such heterogeneous patterns are simi- 
lar to those associated with traditional measures 
of economic size or relevance (such as GDP or 
population) but have a more intensive charac- 
ter, since small countries can display a high Sci- 
entific Fitness while large ones may not, e.g. 
Switzerland ranks higher than India. A higher 
correlation is observed with the Economic Fit- 
ness computed using export data Bol, 
which is also aimed to measure competitiveness 
based on owned capabilities, tough there is not a 
one-to-one correspondence between the two mea- 


sures (see SI for a further comparison). No- 
tably, the raking of Scientific Fitness is also dif- 
ferent from that obtained using metrics based 
on citation shares, such as the Mean Normalized 
Citation Score (MNCS) [8| [3], which measure 
research efficiency rather than competitiveness. 
Indeed, MNCS ranks at the top the small but 
efficient research systems — such as Switzerland, 
Israel and Singapore. Instead Scientific Fitness 
accounts both for efficiency (through the use of 
the RCA filter) and diversification (i.e., the cu- 
mulative stock of capabilities owned by a na- 
tion), and thus allows for a more fair compar- 
ison between small and large research systems. 
Remarkably the same patterns are observed also 
when the analysis is performed using a different 
dataset (we report in the Supporting Informa- 
tion the case of Scimago [34], based on Scopus). 


Following previous literature on Science of Sci- 
ence [13], we obtain a richer picture 
by complementing Scientific Fitness with the 
amount of resources that are invested in scientific 
research. A similar approach (with some caveats 
discussed below) is also used in the classic EFC 
literature, where Economic Fitness is scattered 
against a monetary measure of income (typi- 
cally the Gross Domestic Product per capita); 
the dynamics in the two dimensional space de- 
fined by these variables highlight clusters of sim- 
ilar economies, allowing for a very precise eco- 
nomic forecasting (35|. As already mentioned, 
here we employ Higher Education expenditures 
on R&D (HERD) [82], namely the expenditures 
for basic research performed in the higher edu- 
cation sector, which among the sources of pub- 
lic funding are those most connected to scien- 
tific performance as measured through citations 
of published documents This data is avail- 


The other sources of public funding are [32]: the 
Business Expenditures on R&D (BERD), namely R&D 
expenditures performed in the business sector, which is 
mostly related to the creation of new products and pro- 
duction techniques (patents); the Government Intramu- 
ral Expenditures on R&D (GOVERD), namely expendi- 
tures in the government sector, which is often mission- 
oriented and therefore less connected to publication out- 
puts (see and the discussion therein). In the Support- 


Figure 1: Map of the Scientific Fitness of nations (TL1, panel a) and of regions (TL2) within North 
America (panel b), Europe and Turkey (panel c) and China, Japan and South Korea (panel d). 
The color scale indicates the average Fitness between 1998 and 2018 (missing entries are colored 
in gray), with darker and lighter tone for higher and lower Fitness, respectively (the scale [0, 1] is 
the same for the national and regional levels). Notice how the Fitness of a nation cannot be simply 
obtained by summing nor averaging the Fitness of its regions (see Figure |3| below). The elliptic 
projection of the map follows the Robinson projection (esri:54030). 


able only for OECD members and a few other 
important economies (such as China and Rus- 
sia); therefore the following analysis will be fo- 
cused only on this subset of high and middle 
income countries. Figure |2|in panel-(a) shows 
the trajectories of these nations in the two di- 
mensional plane defined by Scientific Fitness and 
HERD per capita (HERD-pc). We observe that 
the most developed economies usually concen- 
trate in the top right corner of the diagram (en- 
larged in the inset) characterized by high values 
of both Fitness and HERD-pc. The other na- 
tions are instead scattered along the diagonal, 
for which Scientific Fitness is proportional to re- 
sources invested, and their trajectories are typi- 
cally directed towards the top-right region: these 
countries are quickly catching up with the most 
advanced ones. Off-diagonal trajectories provide 
interesting information, similar to those obtained 
within the economic framework [25]. The top 
left corner contains small national research sys- 
tems with peculiar features, where investments 
are not efficiently turned into scientific competi- 
tiveness. This is for instance the case of Iceland, 
which does not attract much attention in terms 
of citations, and of Luxembourg, where the pres- 
ence several private firms headquarters may bias 
the scientific production to patent-related docu- 
ments [15]. In the opposite corner, China (and to 
a minor extent Russia, South Africa and Mexico) 
features a high scientific competitiveness despite 
low public R&D expenditures, with both quan- 
tities growing quickly in time. 

A main advantage of our framework is the 
possibility to perform the analysis of Scientific 
Fitness at a more detailed geographic level, in 
order to highlight the competitiveness of spe- 
cific regions within nations. The bottom maps 
of Figure |1| report the Scientific Fitness of re- 
gions (as defined by Territorial Level 2 (TL2) 
of the OECD classification) for three macro- 
areas: North America, Europe and East Asia. 
We observe a recurrent pattern for which the 


ing Information we show results of analysis performed 
using Gross Expenditure on R&D (GERD), given by the 
sum of HERD, BERD and GOVERD. 


Fitness of a nation is mostly concentrated in 
its capital region (also because capitals typically 
host the headquarters of the largest national re- 
search institutions). The English-speaking na- 
tions (United States and United Kingdom, and 
the same happens for Australia) are the excep- 
tion by featuring high Fitness in all their regions. 
Such a widespread competitiveness can be also 
due to the language bias of the dataset, which 
covers non-English literature only partially, es- 
pecially for Social Sciences and Humanities 
(see Materials and Methods and further analy- 
sis in the Supporting Information), and possibly 
to the advantage of native English speakers in 
better writing scientific articles which therefore 
attract more citations. The evolution of Scien- 
tific Fitness and HERD-pc at the regional level 
is shown in the right panel of Figure We 
see again that while most of the North Amer- 
icans regions are top performers, Fitness val- 
ues of European regions form a cloud ranging 
from low to high competitiveness. The case of 
China stands alone: only three provinces (Bei- 
jing, Hong Kong and Tianjin) belong to the cloud 
of EU regions, while the others follow a very reg- 
ular flow with a steady increase both in compet- 
itiveness and public expenditures. Indeed China 
invested enormously in science starting from the 
end of the last century, with growing expendi- 
tures in R&D throughout the country. Apart 
from the three outliers, the competitiveness of 
Chinese provinces has not yet reached that of 
the western countries regions, but it will eventu- 
ally do [88][89]. This can be clearly seen in panel- 
(b) of Figure|3| where the trajectories of regional 
Scientific Fitness are scattered with those of doc- 
ument Fitness, i.e., competitiveness computed 
on document production rather than citation ac- 
crued (see the Materials and Methods). Main- 
land Chinese provinces follow a unique pattern. 
Their document Fitness has increased substan- 
tially in the considered time span (2000-2018), 
due to growing resources and the consequent ac- 
quisition of new capabilities. However, initially 
this research output was not able to capture 
many citations from the international scientific 
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Figure 2: (panel a) Trajectories of nations (TL1) in the plane defined by Scientific Fitness and 
resources invested, the latter measured by Higher Education expenditures on R&D per capita 
(HERD-pc). Line colors are used to group nations into macro-areas: dark blue for west EU nations 
(plus Switzerland, Israel, Norway, Island), light blue for east EU nations, , green for the English- 
speaking nations (United States, United Kingdom, Canada, Australia, New Zealand) red for China, 
yellow for the other Asian nations (Singapore, South Korea, Japan) and purple for middle-income 
countries (Russia, South Africa, Mexico, Argentina, Chile). Trajectories represent data from 2000 to 
2017, with the arrow indicating the direction of time. The inset zooms on the top-right corner where 
there is a concentration of highly competitive nations. (panel b) Trajectories are also displayed 
for regions (TL2) belonging to China and a selection of EU west, EU east and North America 
nations. (panel c) Cross-correlation between Scientific Fitness and HERD at the national scale 
(TL1) averaged over the whole set of countries as a function of the temporal delay (A year) used 
to compute these quantities. The blue contour represents the 25 — 75% quantile, generated with a 
bootstrapping technique. Note that a cross-correlation value of about 0.5 is comparable to analogous 
estimations carried out in the economics context 
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Figure 3: (panel a) Comparison of Scientific Fitness and document Fitness (i.e., Fitness computed 
using published documents) at the regional level (TL2). The black lines indicate the density level 
contour of the cloud of points while each red trajectory indicates the evolution of a Chinese province. 
The trajectories map the evolution from 2000 to 2018, with the arrow indicating the direction of 
time. (panel b) Comparison between the Scientific Fitness of nations, computed either at the 
national level (TL1) or as the mean (red line) or maximum (blue line) of the Fitness of internal 
regions (TL2). (panel c) Gini coefficients of each nation, computed over the citation counts of 
internal regions. We report values for two years: 1995 (full color bars) and 2015 (shade color bars). 
Nations are ordered according to their average Scientific Fitness in the central decade (2000-2010). 
The inset represents the temporal evolution of the Gini coefficient of the whole world. 


community, likely due to a low initial level of 
competitiveness. Only recently Chinese research 
became very competitive and started to attract 
citations, with a consequent growth in Scientific 
Fitness: the trajectories of Chinese provinces are 
quickly moving towards the main cluster where 
the regions of other countries are located. 


Overall the results of the analysis at TL2 indi- 
cate that the Fitness of a nation is not obtained 
by simply averaging or summing up the Fitness 
of its regions, because the most exclusive capa- 
bilities are typically concentrated only in a few 
regions, which thus determine the national com- 
petitiveness. This is confirmed by the plot in 
panel-(a) of Figure [3| which shows that the na- 
tional Fitness is more correlated to the Fitness 
of its most competitive region (Pearson correla- 
tion of about 0.96) rather than to the mean Fit- 
ness of its regions (Pearson correlation of about 
0.70). More importantly, our framework high- 
lights the strong heterogeneity of Fitness values 
both across and within nations, and thus allows 
locating the geographical inequalities of the sci- 
entific research system. Values of the Gini coeffi- 
cients (see Materials and Methods for the precise 
mathematical definition of the Gini coefficients 
implemented) are shown in the bottom panel of 
Figure|3]for the available nations with more than 
4 TL2 regions, and for two reference years (1995 
and 2015) spaced by two decades. The anal- 
ysis shows that the United Kingdom and Aus- 
tralia have the lower inequality score and in gen- 
eral the English-speaking nations feature low in- 
equalities, while mid-income countries are char- 
acterized by the highest inequality levels. We 
also compute the global Gini coefficient over all 
available regions in the world; the Inset in panel- 
(c) of Figure [3| shows that the global level of in- 
equality is slowly decreasing in time. 


Down-scaling the analysis from nations (TL1) 
to regions (TL2) means increasing the geograph- 
ical resolution of our method. Similarly, we can 
increase the resolution regarding the research 
sectors, by exploiting the hierarchical classifica- 
tion of FoS. Thus, for example, instead of com- 
puting the total Scientific Fitness of a geographic 


area we can compute its sector Fitness restricted 
to one of the 19 entries of the FoS main hier- 
archical level. Figure |4| shows the radar plots 
of the sector Fitness for some example regions. 
The 19 research sectors are ordered clockwise in 
the radar according to their complexity (com- 
puted as the average complexity of their sub- 
sectors), so that Business is the most complex 
and Material Science is the less complex Fos. 
Note that the EFC algorithm typically assigns 
higher complexity to soft sciences (Economics, 
Social Sciences and Humanities) rather than to 
medical and hard sciences, because it turns out 
that only the most competitive players are ac- 
tive in the former sectors, while the latter sec- 
tors are more ubiquitous. This pattern can be 
partially due to the aforementioned bias of our 
bibliometric data towards English-speaking na- 
tions in soft sciences. However a more fundamen- 
tal explanation exists: only the most developed 
research systems have reached the level of ca- 
pabilities required to perform scientific research 
in, e.g., Business Administration, Environmen- 
tal Ethics and Cognitive Science. These sectors 
require solid prerequisites in the hard sciences, 
but they are not necessarily related to high tech- 
nological requirements?| rather they are aimed 
at addressing the most advanced needs of a so- 
ciety [23]. Overall, the analysis of the scientific 
sector Fitness allows to quantitatively detect the 
strengths and weaknesses of each region, as well 
as their temporal evolution. For instance Fig- 
ure [4] shows how the Beijing region experienced 
a fast growth in competitiveness in the hard sci- 
ences while it still falls back in artistic and cul- 
tural areas with respect to western regions. Re- 
gions like Zurich, Lazio and Alberta have instead 
a more uniform pattern of competitiveness, espe- 
cially in the last decades. Note how the top- 


2Note that the average complexity of a research sector 
does not fully reflect the complexity of the associated sub- 
sectors. Indeed also in the hard sciences there are highly 
sophisticated research sectors that require expensive in- 
struments and large collaborations. For example, while 
the average complexity of Business is 1.82, the complex- 
ity of Polymer science, a child code of Material science 
and Chemistry, is as high as 5.34. 
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Figure 4: Radar plots of the scientific sector Fitness of different sample regions (TL2). Top row: 
Nevada (USA), Alberta (Canada), New South Wales (Sydney). Central row: Zurich (Switzerland), 
Lazio (Italy), Rhineland-Palatinate (Germany). Bottom row: Beijing (China), Hong Kong (China), 
Rio de Janeiro (Brazil). Sectors are ordered clockwise with decreasing average complexity (Business 
is the most complex and Material Science is the less complex sector). The radar lines indicate how 
Fitness has evolved over the course of thirty years, from 1985 to 2015. 


performing regions like New South Wales can 
also have competitive gaps but only in the less 
complex sectors. 


3 Discussions and Conclu- 


sion 


This work aims to bring together two recent lines 
of research: Science of Science [8} {40} [9], which 
develops quantitative methods and assessment 
tools to study the evolution of science itself, and 
Economic Fitness and Complexity [47], 
which aims at measuring the productive capa- 
bilities of economic systems. Indeed, our frame- 
work to assess competitiveness in scientific re- 
search builds on the theory of hidden capabilities 
and employs properly calibrated bibliometric in- 
dicators. The proposed methods allow for a con- 
sistent comparison between different geograph- 
ical areas and research sectors at varying level 
of resolution. In this work we presented only a 
handful of applications, highlighting the hetero- 
geneity of scientific competitiveness among na- 
tions as well as the inequalities within national 
research systems. We further characterized the 
performance of scientific actors across the vari- 
ous research sectors, and showed that the evo- 
lution of research systems can be properly de- 
scribed using two dimensions, Scientific Fitness 
and R&D expenditure. In the plane defined by 
these variables, nations form clusters of similar 
research systems operating within countries that 
have reached comparable stages of development. 

Similarly to other the classic applications in 
the EFC literature, this study shows that a high 
explanatory and forecasting power is achieved 
when Economic Fitness is coupled with a vari- 
able related to the amount of resources avail- 
able in the system under enquiry. ‘Typically, 
the EFC literature proxies resource endowments 
with Gross Domestic Product (GDP) [85]; for 
our purposes, HERD is is the more appropriate 
measure. However there is a fundamental differ- 
ence between the use of GDP and HERD. GDP is 
a measure of generated capital and wealth, hence 
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it reflects the outcome of the production pro- 
cess; for this reason GDP can be interpreted as 
a a consequence of Economic Fitness. Instead, 
HERD measures the amount of public resources 
that are fed into the scientific system and thus 
is an input requisite for Scientific Fitness. Con- 
sequently, while both the trajectories of coun- 
tries in the GDP-Fitness plane and the trajec- 
tories in the Scientific Fitness-HERD allow to 
extract interesting patterns concerning the way 
in which nations cluster in the plane, there are 
also remarkable differences in their interpreta- 
tion. For instance, there is evidence suggesting 
that one can qualitatively predict 5-years trends 
in GDP in the light of the historical evolution of 
economic fitness [35]. However, it would be over- 
confident to push the analogy between Economic 
Fitness and Scientific Fitness to the point of try- 
ing to infer future Scientific Fitness from histor- 
ical HERD data. A comprehensive analysis of 
the relation between Scientific Fitness and dif- 
ferent measures of input and output of research 
systems represents a promising avenue for future 
research. 


In addition to uncovering non-trivial patterns 
in the evolution of national and regional knowl- 
edge production systems, the application of the 
EFC methodology to the realm of scientific pro- 
duction data also has the potential relevance for 
policy making. Even though the direct concern 
of Economic policy is not so much knowledge 
creation, but rather Economic output or innova- 
tion, it is known that competitiveness in scien- 
tific fields is robustly linked to the development 
of competitive advantages in patenting as well as 
export [3]. Since success in one of the above three 
layers — knowledge, innovation, trade — tends to 
be a precursor of success in the others, it is rea- 
sonable to argue that a long-sighted approach to 
growth and development policies can only ben- 
efit from factoring knowledge production capa- 
bilities into the equation. Finally, the analysis 
of the scientific competitiveness of regional areas 
add a tool in the analysis of local capabilities, 
necessary in the developments of less wealthy re- 
gions. 


4 Material and Methods 


We extract the scientific database from 
the Open Academic Graph (OAG) 


https://www.microsoft.com/en-us/ 
research/project/open-academic-graph/ 


a freely available snapshot of a two billion- 
scale academic graph resulting from the 
unification of Microsoft Academic Graph and 
AMiner [30]. We use OAG v2, created 
at the end of November 2018. The database 
is composed by a list of entries related to 
various scientific literature: journal articles, 
books, conference proceedings, reviews, and 
others. The OAG coverage is estimated to be 
comparable to that of Scopus or Wos [42], thus 
likely presenting similar geographical and pho- 
netic biases — in particular the partial coverage 
of non-English written literature, especially 
in the Social Sciences and Humanities where 
research output is often published in the native 
language [37|. The OAG data spans more than 
a century, starting in principle at the beginning 
of 1800. In practice, data before the Second 
World War presents large fluctuations mainly 
due to the scarce amount of scientific production 
for most of the regions. Hence, we start the 
analysis in 1960, although the core results are 
presented only for the recent decades where also 
expenditure data is available (see below). 

The classification of research sectors is defined 
by the Fields of Study (FoS), features which are 
dynamically evaluated by an “in-house knowl- 
edge base related entity relationship, which is cal- 
culated based on the entity contents, hyperlinks 
and web-click signals” [28]. The FoS are mostly 
organized into a hierarchical structure, with the 
main characteristic that a code may have more 
parents? This structure presents a static layer 0 
with 19 hand-defined codes, corresponding to the 
main classification of the research sectors. Mov- 
ing deeper in the hierarchy, layer 1 presents 294 


3The very few exceptions of codes that are labelled 
at a fine level but without information on their parents 
are removed from the analysis. This does not represent a 
problem, since we consider only the highest levels of the 
FoS hierarchy. 
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codes while layer 2 has more than 80 000 codes 
and this number may change in time when new 
FoS are generated [f] 

The OAG database is used to construct the 
bipartite network linking geographical areas to 
research sectors. To this end we select only the 
OAG entries with full information on authors’ 
affiliation, FoS, citations count and year of publi- 
cation. Using this data we build tables reporting, 
for each year, the number of scientific documents 
produced by each geographical areas in the var- 
ious FoS, and their citations received up to the 
OAG creation date. In order to assign a docu- 
ment to a geographic area, OAG uses the loca- 
tion of the authors’ main affiliation. Note that in 
some cases it is not possible to select a precise lo- 
cation because the affiliation may address gener- 
ically to a multinational firm or a multi-location 
research council (such as CNRS in France or 
CNR in Italy). In these cases the location of the 
headquarter is used, although this process may 
artificially boost capital regions. Note also that 
there are several documents labeled by multiple 
FoS and/or with several author affiliations. In 
these cases we employ a fractional counting ap- 
proach, by assigning the document to FoS and 
geographic areas with a weight that is inversely 
proportional to number of FoS and number of 
authors [>| Fractional counting has the main ad- 
vantage that allows aggregating tables on both 
the geographical and FoS dimensions without 
increasing disproportionately the weight of the 
most productive actors. Additionally, fractional 
counting has to be preferred as it better bal- 
ances the scientific outputs of large and small 
geographical regions [43]. 

Following the classical approach of the Sci- 
entometrics literature, we use citations received 


“Deeper layers 3, 4 and 5 mostly split the larger topics, 
but are not considered in the present work. 

5For example, if a paper is labeled with FoS sı and s2 
and has three authors, the first two affiliated with (also 
different) institutions in area gı and the third with an 
institution in area g2, the paper is assigned to FoS sı 
and s2 with the same weight 1/2, while it is assigned to 
geographic areas gı and g2 respectively with weights 2/3 
and 1/3. The paper’s citations are split according to the 
same ratios. 


by scientific documents as a reliable proxy for 
the quality of research [8]. However, the sim- 
ple citations count presents a few drawbacks, es- 
pecially when used to assess a small corpus of 
papers. This is due to the time papers need to 
reach a stable level of citations [44], to the high 
skewness of the citation distribution for single 
papers [46], and to the dependence of cita- 
tion patterns on the specific sector and journal 
considered. Indeed the dynamical process un- 
derlying the evolution of citation counts is well 
modeled using a preferential attachment pro- 
cess [17] [48] [49]. This means that the sum of the 
citations accrued by a set of papers is dominated 
by the citations of the few most cited outliers, 
which are in turn subject to strong statistical 
fluctuations (especially in small sets). A simple 
yet effective approach to reduce such fluctuations 
as well as the skewness of the citations distribu- 
tion consists in using a logarithmic transforma- 
tion [5I]. Thus we employ the log-citations 
count 


(1) 


where g labels a geographical area and s a re- 
search sector, while cg, is the citation count of 
documents assigned to area g for FoS s published 
in a given year. 

We further filter the log-citations counts to 
build a Scientific Bipartite Network (SBN), re- 
lating for each year the geographical areas on 
one set with the research sectors in which they 
are competitive on the other set. To this end 
we use an index borrowed from the economics 
literature, the Revealed Comparative Advantage 
(RCA) [O], which measures competitiveness as 
the ability of an actor to perform an activity 
more that a reference level — the latter given by 
the global average performance of the selected 
activity. Applied to our case study, a geograph- 
ical area is considered competitive in a research 
sector if its RCA is above a threshold, typically 
set to 1. In formula: 


Wes =log(1 F Cys) 


Wgs >y W's 


PC, uaa 


(2) 
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We thus build the SBN using the binary filter 
Mgs = 1 if RCAgs, > 1 and Mgs = 0 otherwise. 
Note that before implementing the filter we ap- 
ply an exponential smoothing to the RCA series, 
considering a half-life of 3 years in order to keep 
a short persistence in the data. 

At last we feed to SBN to the Fitness and com- 
plexity algorithm [52]. The method ex- 
ploits the nested structure of the network and 
obtains the Fitness F’ or competitiveness of a ge- 
ographic area g by aggregating the complexities 
of its basket of research sectors in a non-linear 
way (so that the most complex sectors of activ- 
ity weigh the most), and in the same way the 
the complexity C of a research sector s is given 
by the Fitness of the geographic areas that are 
active in it (with low competitive regions weight- 
ing the most). Operationally, the Fitness and 
the complexity vectors are the fixed point of the 
following non-linear iterative map 


= E p) 
n n—l1 n 

E T E FO) = = 
do yg 

A(n) 

(n) | gm = 1 
Í = Qh) 
>- Mgs in) S s 


where the operator (-), indicates the arithmetic 
mean with respect to the possible values assumed 
by the variable dependent on the set x. Fixed 
point values of the Fitness are finally normalized 
by a reference value, which is taken to be the 
Fitness of United States at TL1 and that of Cal- 
ifornia at TL2. Fixed point values of the Fitness 
are finally normalized by a reference value, which 
is taken to be the Fitness of United States at TL1 
and that of California at TL2 (US06). The nor- 
malization aims to regularize the heterogeneous 
distribution of Fitness among the years, enhanc- 
ing the relative strength of the nations instead of 
a global competitiveness. Note that we build two 
kind of Fitness indicators: the Scientific Fitness 
based on log-citations counts of eq. (1p, and the 
document Fitness when log-documents counts is 
used in its place. 

We quantify the degree of scientific inequality 


within a nation using the Gini coefficient esti- 
mated on the dispersion of citation counts among 
its regions (in the Supporting Information 
we consider a version of the Gini coefficient that 
takes population size into account). For our pur- 
poses, the Gini index can be written as follows: 


Sn 


where S; = X j= f(y;)yj, So = 0, f(y:) is the 
fraction of regions within the same country that 
has received at least y; citations, and yi < yj 
whenever 7 < j. 

We remark that the OAG database allows 
obtaining the SBN at different levels of ge- 
ographic aggregations, ranging from the fine- 
grained description of individual institutions to 
the macroscale of regions and nations. In this 
work we focus on the macroscopic scale, in or- 
der to compare with previous literature of EFC 
and Science-of-Science. Leveraging the OECD 
Territorial Level Classification we generate 
the SBN both at the Territorial Level 1 (TL1) 
of nations (207 countries, following the nowa- 
days world structure) as well as at the Territo- 
rial Level 2 (TL2), which includes 577 distinct 
regions flin 43 countries (some of which are not 
OECD members). 

The expenditure database is based on the 
available data collected by the OECD on the 
Gross Expenditure in Research and Develop- 
ments (GERD) indices [82]. The database cov- 
ers 48 countries, i.e. all the OECD members and 
few other relevant nations for which the data is 
made available, such as China and Russia. How- 
ever, the data’ quality depends strongly on na- 
tional features, and the HERD database imple- 
mented in the analysis above is made available 
for 42 nations (among the OECD members on 
Colombia does not provide information of the ex- 
penditure). We implement a linear interpolation 
reconstructing the missing points, At TL2, the 
database follows the same classification imple- 
mented by the derivation of the territorial level 


Gel 


©There are in principle more than 700 regions but for 
some of them there is no affiliation found. 


SBN, edited by the OECD. However, the recon- 
struction at the lower scales is interpolated keep- 
ing constant the national performances, since the 
data presents more than 50% of missing entries. 


Data Availability 


The datasets generated and analysed during 
the current study are available in the Sci- 


entific database repository, |nttps://efcdata. 
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A Supporting Information 


Appendix (SI) 


A.1 Coverage Soft Science / Hard 
Science 


The Open Academic Graph (OAG) does not pro- 
vide information on the uniformity of the cover- 
age of the different sectors and nations. Indeed, a 
problem faced by others databases, such as SCO- 
PUS, is that English-speaking and developed na- 
tions have a full coverage of the literature pro- 
duced in all the scientific domains, from hard sci- 
ences to soft and social sciences, while the rest of 
the nations may have only a partial coverage, es- 
pecially in the soft sciences that are mostly writ- 
ten in national languages. This biases is not ad- 
dressed by OAG but it can be estimated through 
the computation of the ratio among the scien- 
tific production in hard and soft sciences of the 
nations. Defining soft sciences the set of FoS 
sons of (Sociology, Political Science, Art, Busi- 
ness, Philosophy, History) and hard science the 
others, OAG presents a language bias that can 
be visualized in figure [5] 
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Figure 5: The world map where the color-map correspond to the ration of the overall production 
of the nations in soft science with respect to hard science. 


A.2 Scientific Fitness — Economic 
Fitness 


The Scientific Fitness is a measure of compet- 
itiveness of the national and regional research 
systems, as discussed in the main text. ‘The sci- 
entific competitiveness depends on the competi- 
tiveness on the others sectors of innovations such 
as the Economic Fitness [24] [33], since there is a 
strong interaction among them. However, there 
is no 1-1 relation among the different competi- 
tiveness, as Shown in figure [6] Indeed, high Eco- 
nomic Fitness usually translate to high Scientific 
Fitness while the contrary is not found. 


A.3 GERD versus Scientific Fit- 
ness 


The scientific expenditure, collected by the 
OECD, is aggregated in the Gross Expenditure 
in Research and Developments (GERD), avail- 
able for 42 nations. The database can be decom- 
posed in the Governmental expense (GOVERD), 
the Business part (BERD) and the Higher Ed- 
ucational part (HERD). Although HERD corre- 


lated well with the scientific success, it is pos- 
sible to derive the same qualitative information 
considering the gross expenditure, where simi- 
lar research systems clusters in the expenditure- 
Fitness diagram. Figure [7] indicate that the tra- 
jectories of the developed nations follow the ones 
shown in the main text. Remarkably, the larger 
difference with the HERD diagram relies on the 
position of China. Indeed, gap that China’s 
HERD has with respect to the developed nations 
can be partially explained by its higher amount 
of GOVERD expenditures. 


A.4 Inequality metrics with the in- 
formation of the population 
size 


The inequality implemented in the main text 
is computed following the procedure discussed 
in . However, the computation of the inequal- 
ity does not account for the different population 
density and the heterogeneity naturally available 
on the countries. Remarkably, the OECD col- 
lect the data on the number of permanent re- 
searchers that may be a good estimation of 
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Figure 6: Trajectories of the nations in the diagram scattering the Scientific Fitness and the Eco- 
nomic Fitness 
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Figure 7: Trajectories of nations (TL1, left panel) in the plane defined by Scientific Fitness and 
resources invested, the latter measured by Gross Expenditures on R&D per capita (GERD-pc). Line 
colors are used to group nations into macro-areas: dark blue for west EU nations (plus Switzerland, 
Israel, Norway, Island), light blue for est EU nations, red for China, purple for middle-income 
countries (Russia, South Africa, Mexico, Argentina, Chile), and green for the English-speaking 
nations (United States, United Kingdom, Canada, Australia, New Zealand), and yellow for the 
Asian nations (Singapore, South Korea, Japan). Trajectories represent data from 2000 to 2017, 
with the arrow indicating the direction of time. The inset zooms on the top-right corner where 
there is a concentration of highly competitive nations. Trajectories are also displayed for regions 
(TL2, right panel) belonging to China and a selection of EU west, EU east and North America 
nations. At last the central panel in the bottom displays the cross-correlation between Scientific 
Fitness and GERD at the national scale (TL1) averaged over the whole set of countries as a function 
of the temporal delay (A year) used to compute these quantities. The blue contour represents the 
25 — 75% quantile, generated with a bootstrapping technique. 
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the population size in the case of the scientific 
inequality. Figure |8}shows the bars of the Gini 
index as it is implemented in the main text and 
a weighted version, where the weight are propor- 
tional to the sizes of the researcher’s population. 

The difference among the measure appears rel- 
evant only in some developed nations while it is 
does not modify the global picture. 


A.5 Comparison of OAG with 
Scimago 


A second source of data available for the con- 
struction of the Scientific Bipartite Network is 
the database offered by Scimago [34]. This 
database is aggregated by Scimagolab using the 
data available from the SCOPUS database, 
collected by Elsevier. The Scimago database of- 
fers the matrices of scientific performance (ci- 
tation counts and document productions among 
the others) at the national level and implements 
the full counting statistics: each document as- 
signs a unitary value at each nation having at 
least an affiliation among the authors. Thus 
each national value corresponds to the number of 
papers produced by researchers operating from 
the nations, independently on the collaboration 
sizes. 

The scientific classification implemented on 
the database is the All Science Journal Classifi- 
cation (ASJC) [55], which gives at the finer level 
327 codes, and it is based on the journal classifi- 
cation on which the scientific documents appear. 
Thus, the classes does not depend directly from 
the context of the paper but rather on the topic 
of the journal, reducing the precision of the anal- 
ysis based on capabilities. 

Despite these aforementioned differences with 
respect to the OAG database implemented in 
the main manuscript, the Scientific Fitness com- 
puted on the Scimago database does not dif- 
fers from the one obtained from OAG, except in 
the subset of the English-speaking nations. In- 
deed, the English-speaking nations and primar- 
ily the USA outperform the competitiveness of 
the other nations, collecting most of the global 
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Fitness. Figure (9] shows the scatter plot of the 
not normalized Fitness using the SBN based on 
Scimago and OAG and the English-speaking na- 
tions (green dots) is the only set not ling in the 
main cluster of points. Removing the outliers, 
there is a very good correlation between the Fit- 
ness based on Scimago and on OAG. Remark- 
ably, the language bias found in OAG is less dom- 
inant with respect to the bias of SCOPUS, in the 
computation of the Scientific Fitness. 
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